BAIT 509 Assignment 2: Preprocessing, Pipelines and Hyperparameter Tuning¶
Evaluates: Lectures 4 - 6.
Rubrics: Your solutions will be assessed primarily on the accuracy of your coding, as well as the clarity and correctness of your written responses. The MDS rubrics provide a good guide as to what is expected of you in your responses to the assignment questions and how the TAs will grade your answers. See the following links for more details:
- mechanics_rubric: submit an assignment correctly.
- accuracy rubric: evaluating your code.
- reasoning rubric: evaluating your written responses.
- autograde rubric: evaluating questions that are either right or wrong (can be done either manually or automatically).
Tidy Submission¶
rubric={mechanics:2}
- Complete this assignment by filling out this jupyter notebook.
- Any place you see
...or____, you must fill in the function, variable, or data to complete the code. - Use proper English, spelling, and grammar.
- You will submit two files on Canvas:
- This jupyter notebook file containing your responses ( an
.ipynbfile); and, - An
.htmlfile of your completed notebook that will render directly on Canvas without having to be downloaded.- To generate this html file you can click
File->Export Notebook As->HTMLin JupyterLab or type the following into a terminaljupyter nbconvert --to html_embed assignment.ipynb).
- To generate this html file you can click
- This jupyter notebook file containing your responses ( an
Submit your assignment through UBC Canvas by the deadline listed there.
Introduction and learning goals ¶
Welcome to the assignment! In this assignment, you will practice:
- Identify when to implement feature transformations such as imputation and scaling.
- Apply
sklearn.pipeline.Pipelineto build a machine learning pipeline. - Use
sklearnfor applying numerical feature transformations on the data. - Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.
- Explain strategies to deal with categorical variables with too many categories.
- Use
ColumnTransformerto build all our transformations together into one object and use it withscikit-learnpipelines. - Carry out hyperparameter optimization using
sklearn'sGridSearchCVandRandomizedSearchCV.
Introduction ¶
A crucial step when using machine learning algorithms on real-world datasets is preprocessing. This assignment will give you some practice to build a preliminary supervised machine learning pipeline on a real-world dataset.
Exercise 1: Introducing and Exploring the dataset ¶
In this assignment, you will be working on a sample of the adult census dataset that we provide as census.csv. We have made some modifications to this data so that it's easier to work with.
This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features here.
Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.
import pandas as pd
census_df = pd.read_csv("census.csv")
census_df.head()
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 74 | State-gov | 88638 | Doctorate | 16 | Never-married | Prof-specialty | Other-relative | White | Female | 0 | 3683 | 20 | United-States | >50K |
| 1 | 41 | Private | 70037 | Some-college | 10 | Never-married | Craft-repair | Unmarried | White | Male | 0 | 3004 | 60 | ? | >50K |
| 2 | 45 | Private | 172274 | Doctorate | 16 | Divorced | Prof-specialty | Unmarried | Black | Female | 0 | 3004 | 35 | United-States | >50K |
| 3 | 38 | Self-emp-not-inc | 164526 | Prof-school | 15 | Never-married | Prof-specialty | Not-in-family | White | Male | 0 | 2824 | 45 | United-States | >50K |
| 4 | 52 | Private | 129177 | Bachelors | 13 | Widowed | Other-service | Not-in-family | White | Female | 0 | 2824 | 20 | United-States | >50K |
1.1 Data splitting¶
rubric={accuracy:2}
To avoid violation of the golden rule, the first step before we do anything is splitting the data.
Split the data into train_df (80%) and test_df (20%). Keep the target column (income) in the splits so that we can use it in EDA.
Please use random_state=893, so that your results are consistent with what we expect.
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(census_df, test_size=0.2, train_size=0.8, random_state=893)
Let's examine our train_df,
you can just follow along for the next few cells.
train_df
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11523 | 31 | Self-emp-not-inc | 30290 | HS-grad | 9 | Never-married | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 2234 | 39 | State-gov | 122011 | Masters | 14 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 5178 | 0 | 38 | United-States | >50K |
| 3531 | 63 | Self-emp-not-inc | 125178 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 4597 | 35 | State-gov | 89040 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 10095 | 27 | Private | 176972 | Assoc-voc | 11 | Never-married | Craft-repair | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7086 | 32 | Private | 226696 | Bachelors | 13 | Never-married | Exec-managerial | Not-in-family | White | Male | 0 | 0 | 55 | United-States | >50K |
| 3620 | 44 | Local-gov | 254146 | Masters | 14 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 14968 | 30 | Private | 252752 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 1628 | 46 | Private | 243190 | HS-grad | 9 | Married-civ-spouse | Adm-clerical | Wife | White | Female | 7688 | 0 | 40 | United-States | >50K |
| 11762 | 31 | Self-emp-inc | 83748 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Wife | Asian-Pac-Islander | Female | 0 | 0 | 70 | South | <=50K |
12545 rows × 15 columns
train_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 12545 entries, 11523 to 11762 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 12545 non-null int64 1 workclass 12545 non-null object 2 fnlwgt 12545 non-null int64 3 education 12545 non-null object 4 education_num 12545 non-null int64 5 marital_status 12545 non-null object 6 occupation 12545 non-null object 7 relationship 12545 non-null object 8 race 12545 non-null object 9 sex 12545 non-null object 10 capital_gain 12545 non-null int64 11 capital_loss 12545 non-null int64 12 hours_per_week 12545 non-null int64 13 native_country 12545 non-null object 14 income 12545 non-null object dtypes: int64(6), object(9) memory usage: 1.5+ MB
It looks like things are in order, but there is a hidden gotcha with this dataframe. Let's look at the unique values of each column.
from IPython.display import HTML # This step is just to avoid the long columns being truncated as "..."
HTML(
train_df
.select_dtypes(object)
.apply(lambda x: sorted(pd.unique(x)))
.to_frame()
.to_html()
)
| 0 | |
|---|---|
| workclass | [?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay] |
| education | [10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school, Some-college] |
| marital_status | [Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed] |
| occupation | [?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving] |
| relationship | [Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife] |
| race | [Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White] |
| sex | [Female, Male] |
| native_country | [?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia] |
| income | [<=50K, >50K] |
You can see that there are question marks in the columns "workclass", "occupation", and "native_country".
Unfortunately it seems like the people collecting this data used a non-conventional way to indicate missing/unknown values
instead of using the standard blank/NaN.
Our first step would be to do this conversion manually,
so that ? is not interpreted as an actual value by our models.
import numpy as np
train_df_nan = train_df.replace("?", np.nan)
test_df_nan = test_df.replace("?", np.nan)
1.2 Describing your data¶
rubric={accuracy:2}
Use .describe() to show summary statistics of each feature in the train_df_nan dataframe.
Figure out how to show numerical and categorical columns separately
by reading the describe docstring to figure out which parameter to use.
# Numerical
train_df_nan.describe(include='int64')
| age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week | |
|---|---|---|---|---|---|---|
| count | 12545.000000 | 1.254500e+04 | 12545.000000 | 12545.000000 | 12545.000000 | 12545.000000 |
| mean | 40.579514 | 1.899356e+05 | 10.594420 | 1991.478198 | 123.427740 | 42.224552 |
| std | 12.870556 | 1.062844e+05 | 2.617251 | 10164.524089 | 478.764687 | 12.217008 |
| min | 17.000000 | 1.228500e+04 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 31.000000 | 1.178720e+05 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
| 50% | 40.000000 | 1.779950e+05 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
| 75% | 49.000000 | 2.363910e+05 | 13.000000 | 0.000000 | 0.000000 | 50.000000 |
| max | 90.000000 | 1.484705e+06 | 16.000000 | 99999.000000 | 3900.000000 | 99.000000 |
# Categorical
train_df_nan.describe(include='object')
| workclass | education | marital_status | occupation | relationship | race | sex | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|
| count | 11994 | 12545 | 12545 | 11993 | 12545 | 12545 | 12545 | 12309 | 12545 |
| unique | 8 | 16 | 7 | 14 | 6 | 5 | 2 | 40 | 2 |
| top | Private | HS-grad | Married-civ-spouse | Exec-managerial | Husband | White | Male | United-States | <=50K |
| freq | 8501 | 3567 | 7427 | 2107 | 6577 | 10933 | 9161 | 11283 | 6309 |
1.3 Identifying potentially important features¶
rubric={reasoning:2}
We have provided you with some code that will visualize the distributions of all the numeric and categorical features in the census data. Study the visualizations below to suggest which features you think seem relevant for the given prediction task of building a model to identify who makes over and under 50k. List these features and briefly explain your rationale in why you have selected them.
YOUR ANSWER HERE
import altair as alt
alt.data_transformers.disable_max_rows() # Allows us to plot big datasets
alt.Chart(train_df.sort_values('income')).mark_bar(opacity=0.6).encode(
alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=50)),
alt.Y('count()', stack=None),
alt.Color('income')
).properties(
height=200
).repeat(
train_df_nan.select_dtypes('number').columns.to_list(),
columns=2
)
/Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False) /Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False) /Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False) /Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False)
alt.Chart(train_df.sort_values('income')).mark_bar(opacity=0.6).encode(
alt.X(alt.repeat(), type='nominal'),
alt.Y('count()', stack=None),
alt.Color('income')
).properties(
height=200
).repeat(
train_df_nan.select_dtypes('object').columns.to_list(),
columns=1
)
/Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False) /Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False) /Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False) /Users/aizenz/anaconda3/envs/bait509/lib/python3.11/site-packages/altair/utils/core.py:395: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``. col = df[col_name].apply(to_list_if_array, convert_dtype=False)
- For numerical data: age, education_num and hour_per_week columns show significant difference in income
- For categorical data: education, marital_status, occupation, relationship and sex show significant difference in income
- Both selections are based on the overlap data, the less they overlaped, the more significant the difference will be.
1.4 Separating feature vectors and targets¶
rubric={accuracy:2}
Create X_train, y_train, X_test, y_test from train_df_nan and test_df_nan.
X_train = train_df_nan.drop('income', axis=1)
y_train = train_df_nan['income']
X_test = test_df_nan.drop('income', axis=1)
y_test = test_df_nan['income']
1.5 Training?¶
rubric={reasoning:2}
If you train sklearn's SVC model on X_train and y_train at this point, would it work? Why or why not?
No that will no work.
- Since Supportive Vector Machine is using some kind of distance to decide the hyperplain, the unscaled data will have a significant dispersion on contributing to the distance. The larger one will dominate the result.
- There are NA values that SVC can not handdle.
- There are non-numerical(categorical) data that SVC can not handdle
Exercise 2: Preprocessing ¶
In this exercise, you'll be wrangling the dataset so that it's suitable to be used with scikit-learn classifiers.
2.1 Identifying transformations that need to be applied¶
rubric={reasoning:7}
Identify the columns on which transformations need to be applied and tell us what transformation you would apply in what order by filling in the table below. Example transformations are shown for the feature age in the table.
Note that for this problem, no ordinal encoding will be executed on this dataset.
Are there any columns that you think should be dropped from the features? If so, explain your answer.
| Feature | Transformation |
|---|---|
| age | imputation, scaling |
| workclass | imputation, OHE |
| fnlwgt | imputation, scaling |
| education | imputation, OHE |
| education_num | imputation, scaling |
| marital_status | imputation, OHE |
| occupation | imputation, OHE |
| relationship | imputation, OHE |
| race | imputation, OHE |
| sex | imputation, OE? |
| capital_gain | imputation, scaling |
| capital_loss | imputation, scaling |
| hours_per_week | imputation, scaling |
| native_country | imputation, OHE |
The education column should be dropped because it contains the same info as education_num
2.2 Numeric vs. categorical features¶
rubric={reasoning:2}
Since we will apply different preprocessing steps on the numerical and categorical columns, we first need to identify the numeric and categorical features and create lists for each of them (make sure not to include the target column).
Save the column names as string elements in each of the corresponding list variables below
numeric_features = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
'hours_per_week']
categorical_features = ['workclass', 'education', 'marital_status', 'occupation',
'relationship', 'race', 'sex', 'native_country']
2.3 Numeric feature pipeline¶
rubric={accuracy:2}
Let's start making our pipelines. Use make_pipeline() or Pipeline() to make a pipeline for the numeric features called numeric_transformer.
This pipeline will only have one step,
the StandardScaler(),
so technically we didn't need to make a pipeline,
but it is good to be in the habit of working with pipelines
and it also gives us the option to name this step if we want.
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
numeric_transformer = Pipeline(
steps=[
("scaler",StandardScaler())
]
)
2.4 Categorical feature pipeline¶
rubric={accuracy:2}
Next, make a pipeline for the categorical features called categorical_transformer.
To keep things simple,
we will impute on all columns,
including those where we did not find missing values in the training data.
Use SimpleImputation() with strategy='most_frequent'.
Add a OneHotEncoder as the second step and configure it to ignore unknown values in the test data.
from sklearn.preprocessing import OneHotEncoder
categorical_transformer = Pipeline(
steps=[
('imputer',SimpleImputer(strategy='most_frequent')),
('encode',OneHotEncoder(handle_unknown='ignore'))
]
)
2.5 ColumnTransformer¶
rubric={accuracy:2}
Create a column transformer that applies our numeric pipeline transformations to the numeric feature columns
and our categorical pipeline transformations to the categorical feature columns.
Assign this columns transformer to the variable preprocessor.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("numeric",numeric_transformer,numeric_features),
("categorical",categorical_transformer,categorical_features)
],
remainder='passthrough'
)
Exercise 3: Building a Model ¶
3.1 Dummy Classifier¶
rubric={accuracy:3}
Now that we have our preprocessing pipeline setup,
let's move on to the model building.
First,
it's important to build a dummy classifier to establish a baseline score to compare our model to.
Make a DummyClassifier that predicts the most common label, train it, and then score it on the training and test sets
(in two separate cells so that both scores are displayed).
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
preprocessor.fit(X_train)
dummy.fit(preprocessor.transform(X_train),y_train)
dummy.score(preprocessor.transform(X_train),y_train)
0.5029095257074532
dummy.score(preprocessor.transform(X_test),y_test)
0.48836467963021996
3.2 Main pipeline¶
rubric={accuracy:2}
Define a main pipeline that transforms all the different features and uses an SVC model with default hyperparameters.
If you are using Pipeline instead of make_pipeline, name each of your steps columntransformer and svc respectively.
from sklearn.svm import SVC
main_pipe = Pipeline(
steps=[
("columntransformer",preprocessor),
("svc",SVC())
]
)
3.3 Hyperparameter tuning/optimization¶
rubric={accuracy:3}
Now that we have our pipelines and a model, let's tune the hyperparameters gamma and C.
For this tuning,
construct a grid where each hyperparameter can take the values 0.1, 1, 10, 100
and randomly search for the best combination.
To save some running time on your laptops,
use 3-fold crossvalidation to evaluate each result
and only search for 7 iterations,
and set n_jobs=-1.
Return the train and testing score,
set random_state=289,
and optionally verbose=2 if you want to see information as the search is occurring.
Don't forget to fit the best model from the RandomizedSearchCV object
on all the training data as the final step.
This search is quite demanding computationally so be prepared for this to take 2 or 3 minutes and your fan may start to run!
from sklearn.model_selection import RandomizedSearchCV
param_grid = {"svc__gamma": [0.1, 1.0, 10, 100],
"svc__C":[0.1, 1.0, 10, 100]}
random_search =RandomizedSearchCV(main_pipe, param_grid, verbose=2,n_jobs=-1, random_state = 289,n_iter=7,cv=3)
random_search.fit(X_train,y_train)
random_search.best_params_
Fitting 3 folds for each of 7 candidates, totalling 21 fits [CV] END .........................svc__C=0.1, svc__gamma=0.1; total time= 2.9s [CV] END .........................svc__C=0.1, svc__gamma=0.1; total time= 3.1s [CV] END .........................svc__C=0.1, svc__gamma=0.1; total time= 3.1s [CV] END ..........................svc__C=10, svc__gamma=1.0; total time= 6.6s [CV] END ..........................svc__C=10, svc__gamma=1.0; total time= 7.1s [CV] END ..........................svc__C=10, svc__gamma=1.0; total time= 7.5s [CV] END ..........................svc__C=1.0, svc__gamma=10; total time= 7.6s [CV] END ..........................svc__C=1.0, svc__gamma=10; total time= 8.1s [CV] END ..........................svc__C=1.0, svc__gamma=10; total time= 8.9s [CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 2.8s [CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 2.9s [CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 4.1s [CV] END .........................svc__C=100, svc__gamma=0.1; total time= 4.5s [CV] END .........................svc__C=100, svc__gamma=1.0; total time= 9.5s [CV] END .........................svc__C=100, svc__gamma=1.0; total time= 6.3s [CV] END .........................svc__C=100, svc__gamma=1.0; total time= 7.2s [CV] END .........................svc__C=100, svc__gamma=0.1; total time= 3.8s [CV] END .........................svc__C=100, svc__gamma=0.1; total time= 5.1s [CV] END ...........................svc__C=10, svc__gamma=10; total time= 8.0s [CV] END ...........................svc__C=10, svc__gamma=10; total time= 8.5s [CV] END ...........................svc__C=10, svc__gamma=10; total time= 8.2s
{'svc__gamma': 0.1, 'svc__C': 0.1}
3.4 Choosing your hyperparameters¶
rubric={accuracy:2, reasoning:1}
We are displaying the results from the random hyperparameter search
as a dataframe below.
Looking at this table,
which values for gamma and C would you choose for your final model and why?
You can answer this by either manually by using the table
or by accessing the corresponding attributes from the random search object.
pd.DataFrame(random_search.cv_results_)[["params", "mean_test_score", "rank_test_score"]]
| params | mean_test_score | rank_test_score | |
|---|---|---|---|
| 0 | {'svc__gamma': 10, 'svc__C': 1.0} | 0.592109 | 7 |
| 1 | {'svc__gamma': 1.0, 'svc__C': 10} | 0.727940 | 4 |
| 2 | {'svc__gamma': 0.1, 'svc__C': 0.1} | 0.814269 | 1 |
| 3 | {'svc__gamma': 1.0, 'svc__C': 100} | 0.719012 | 5 |
| 4 | {'svc__gamma': 0.1, 'svc__C': 10} | 0.811957 | 2 |
| 5 | {'svc__gamma': 0.1, 'svc__C': 100} | 0.782623 | 3 |
| 6 | {'svc__gamma': 10, 'svc__C': 10} | 0.595537 | 6 |
I will choose gamma = 1 and C = 100 as my parameter set, because they have the highest test score rank.
4. Evaluating on the test set ¶
Now that we have a best-performing model, it's time to assess our model on the test set.
4.1 Scoring your final model¶
rubric={accuracy:2}
What is the training and test score of the best scoring model? Score the model in two separate cells so that both the training and test scores are displayed.
random_search.score(X_train,y_train)
0.8255878836189717
random_search.score(X_test,y_test)
0.8202103920943576
4.2 Assessing your model¶
rubric={reasoning:2}
Compare your final model accuracy with your baseline model from question 3.1, do you consider that our model is performing better than the baseline to such as extent that you would prefer it on deployment data?
Briefly describe one aspect of our model development in this notebook that either supports your confidence in the model we have, or one possible improvement to what we did here that you think could have increased our model score.
The updated model is better than baseline model.
- Dealt with numeric data and categorical data separately.
- Dealt with NAN data.
- Scaling and transforming data.
- fine tune hyperparameters
Submission to Canvas¶
PLEASE READ: When you are ready to submit your assignment do the following:
- Read through your solutions
- Restart your kernel and clear output and rerun your cells from top to bottom
- Makes sure that none of your code is broken
- Convert your notebook to .html format by going to File -> Export Notebook As... -> Export Notebook to HTML
- Upload your
.ipynbfile and the.htmlfile to Canvas under Assignment1. - DO NOT upload any
.csvfiles.